Project presentation

Ander Barrio Campos(231938), Dionysios Dimitreas(s232752), Erikas Mikužis(s223164), Valeria Tedeschi(s231945), Angeliki Vliora(233059)

Introduction

Tidyverse: enhances data manipulation and visualization with a tidy data workflow, fostering code that is

  • readable
  • maintainable
  • reproducible
  • Core packages ggplot2, dplyr, tidyr, readr, broom

Our Dataset

Source: Behavioral Risk Factor Surveillance System (BRFSS) 2015.

Key Features: Health indicators related to diabetes, including:

  • lifestyle factors
  • health outcomes
  • demographic information

:::

Introduction

Research Questions

  1. What are the key predictive variables in diabetes prognosis?

  2. How does gender influence the manifestation and progression of diabetes?

Materials and Methods

Data Cleaning and Augmentation

Data Cleaning

  • Removed Missing Values: df_cleaned <- df |> drop_na()

  • Verified Data Types: column_types <- summarise(df_cleaned, across(everything(), class))

  • Filtered Incorrect Values: Filtered out rows with values outside expected ranges.

Data Augmentation

  • Transformed Variables: Binary to categorical (e.g., Smoker to Smoking Status).

  • Created New Variables: E.g., Habits, Health Risk, based on lifestyle and health indicators.

  • Socio-Economic Class: Derived from income, education, and healthcare status.

Data Analysis

Correlation

  • Between all variables: health related variables correlated between them. Not highly negatively correlated variables. GenHlth and Income negatively correlated.

  • With the target variable: GenHlth, HighBP and BMI most correlated with diabetes variable.

GLM

  • All variables: Creation of a GLM with all numerical variables.

  • Step: Step forward and backward for best variables selection.

  • Results: Lowest AIC achieved with backward model (contains 19 variables). Excluded variables from the full model are Smoker, AnyHealthcare, NoDocbcCost and Education_binary.

Data Analysis

PCA + Logistic Regression

  • Selected components: 15 components that reach 80% of explained variability.

  • Logistic regression: Use of those components to perform a diabetes prediction model.

  • Results: Great accuracy with a value of 87%.

New GLMs

  • Men VS. Women: Creation of two different datasets according to sex.

  • Results: Better performance in Men model due to lowest AIC. More importance to general health variables and also to fruit variable. Much better performance than the GLM from first part of analysis.

Results

  • Smoking status affects mostly the youngest age groups
  • Similar behavior for other age groups
  • BMI tends to increase over the age

  • Among Healthy individuals, women have healthier habits
  • More non-diabetic women have an Average lifestyle than men, opposite for the diabetic counterparts.

Results

Key Findings:

  • Correlation Analysis:
    • Health-related variables positively correlated with diabetes.
    • Higher positive correlation between PhysHlth and GenHlth.
    • Negative correlation between GenHlth and Income.
  • GLM Results:
    • GenHlth, HighBP, and BMI are top predictors for diabetes.
    • Backward selection model excludes Smoker, AnyHealthcare, and NoDocbcCost.
    • Stronger health indicators suggest lower diabetes likelihood.

Results

Key Findings:

  • PCA Analysis:
    • Clear distinction in diabetes status across principal components.
    • Top components capture significant variance in health-related variables.
  • Logistic Regression Model:
    • High accuracy in predicting diabetes using principal components.
    • Confirms the strong link between health indicators and diabetes.
  • Gender-Based GLM Analysis:
    • Men’s and women’s data show different predictive variables’ importance.
    • Suggests gender-specific approaches might be more effective in diabetes prediction.

Discussion

Concluding Thoughts

  • Key Predictive Variables in Diabetes Prognosis:
    • GenHlth, HighBP, and BMI emerged as significant predictors.
    • Highlights the interplay of overall health and specific medical indicators in diabetes risk.
  • Gender’s Influence on Diabetes:
    • Different predictive variables for men and women indicate a gender-specific impact.
    • Suggests tailored approaches in diabetes care based on gender.